{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Web Scraping using Apify Tools\n",
    "\n",
    "This notebook shows how to use Apify tools with AG2 agents to\n",
    "scrape data from a website and format the output."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we need to install the Apify SDK and the AG2 library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install -qqq autogen apify-client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Setting up the LLM configuration and the Apify API key is also required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "config_list = [\n",
    "    {\"model\": \"gpt-4\", \"api_key\": os.getenv(\"OPENAI_API_KEY\")},\n",
    "]\n",
    "\n",
    "apify_api_key = os.getenv(\"APIFY_API_KEY\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's define the tool for scraping data from the website using Apify actor.\n",
    "Read more about tool use in this [tutorial chapter](https://docs.ag2.ai/latest/docs/user-guide/basic-concepts/introducing-tools/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Annotated\n",
    "\n",
    "from apify_client import ApifyClient\n",
    "\n",
    "\n",
    "def scrape_page(url: Annotated[str, \"The URL of the web page to scrape\"]) -> Annotated[str, \"Scraped content\"]:\n",
    "    # Initialize the ApifyClient with your API token\n",
    "    client = ApifyClient(token=apify_api_key)\n",
    "\n",
    "    # Prepare the Actor input\n",
    "    run_input = {\n",
    "        \"startUrls\": [{\"url\": url}],\n",
    "        \"useSitemaps\": False,\n",
    "        \"crawlerType\": \"playwright:firefox\",\n",
    "        \"includeUrlGlobs\": [],\n",
    "        \"excludeUrlGlobs\": [],\n",
    "        \"ignoreCanonicalUrl\": False,\n",
    "        \"maxCrawlDepth\": 0,\n",
    "        \"maxCrawlPages\": 1,\n",
    "        \"initialConcurrency\": 0,\n",
    "        \"maxConcurrency\": 200,\n",
    "        \"initialCookies\": [],\n",
    "        \"proxyConfiguration\": {\"useApifyProxy\": True},\n",
    "        \"maxSessionRotations\": 10,\n",
    "        \"maxRequestRetries\": 5,\n",
    "        \"requestTimeoutSecs\": 60,\n",
    "        \"dynamicContentWaitSecs\": 10,\n",
    "        \"maxScrollHeightPixels\": 5000,\n",
    "        \"removeElementsCssSelector\": \"\"\"nav, footer, script, style, noscript, svg,\n",
    "    [role=\\\"alert\\\"],\n",
    "    [role=\\\"banner\\\"],\n",
    "    [role=\\\"dialog\\\"],\n",
    "    [role=\\\"alertdialog\\\"],\n",
    "    [role=\\\"region\\\"][aria-label*=\\\"skip\\\" i],\n",
    "    [aria-modal=\\\"true\\\"]\"\"\",\n",
    "        \"removeCookieWarnings\": True,\n",
    "        \"clickElementsCssSelector\": '[aria-expanded=\"false\"]',\n",
    "        \"htmlTransformer\": \"readableText\",\n",
    "        \"readableTextCharThreshold\": 100,\n",
    "        \"aggressivePrune\": False,\n",
    "        \"debugMode\": True,\n",
    "        \"debugLog\": True,\n",
    "        \"saveHtml\": True,\n",
    "        \"saveMarkdown\": True,\n",
    "        \"saveFiles\": False,\n",
    "        \"saveScreenshots\": False,\n",
    "        \"maxResults\": 9999999,\n",
    "        \"clientSideMinChangePercentage\": 15,\n",
    "        \"renderingTypeDetectionPercentage\": 10,\n",
    "    }\n",
    "\n",
    "    # Run the Actor and wait for it to finish\n",
    "    run = client.actor(\"aYG0l9s7dbB7j3gbS\").call(run_input=run_input)\n",
    "\n",
    "    # Fetch and print Actor results from the run's dataset (if there are any)\n",
    "    text_data = \"\"\n",
    "    for item in client.dataset(run[\"defaultDatasetId\"]).iterate_items():\n",
    "        text_data += item.get(\"text\", \"\") + \"\\n\"\n",
    "\n",
    "    average_token = 0.75\n",
    "    max_tokens = 20000  # slightly less than max to be safe 32k\n",
    "    text_data = text_data[: int(average_token * max_tokens)]\n",
    "    return text_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the agents and register the tool."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogen import ConversableAgent, register_function\n",
    "\n",
    "# Create web scrapper agent.\n",
    "scraper_agent = ConversableAgent(\n",
    "    \"WebScraper\",\n",
    "    llm_config={\"config_list\": config_list},\n",
    "    system_message=\"You are a web scrapper and you can scrape any web page using the tools provided. \"\n",
    "    \"Returns 'TERMINATE' when the scraping is done.\",\n",
    ")\n",
    "\n",
    "# Create user proxy agent.\n",
    "user_proxy_agent = ConversableAgent(\n",
    "    \"UserProxy\",\n",
    "    llm_config=False,  # No LLM for this agent.\n",
    "    human_input_mode=\"NEVER\",\n",
    "    code_execution_config=False,  # No code execution for this agent.\n",
    "    is_termination_msg=lambda x: x.get(\"content\", \"\") is not None and \"terminate\" in x[\"content\"].lower(),\n",
    "    default_auto_reply=\"Please continue if not finished, otherwise return 'TERMINATE'.\",\n",
    ")\n",
    "\n",
    "# Register the function with the agents.\n",
    "register_function(\n",
    "    scrape_page,\n",
    "    caller=scraper_agent,\n",
    "    executor=user_proxy_agent,\n",
    "    name=\"scrape_page\",\n",
    "    description=\"Scrape a web page and return the content.\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start the conversation for scraping web data. We used the\n",
    "`reflection_with_llm` option for summary method\n",
    "to perform the formatting of the output into a desired format.\n",
    "The summary method is called after the conversation is completed\n",
    "given the complete history of the conversation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chat_result = user_proxy_agent.initiate_chat(\n",
    "    scraper_agent,\n",
    "    message=\"Can you scrape agentops.ai for me?\",\n",
    "    summary_method=\"reflection_with_llm\",\n",
    "    summary_args={\n",
    "        \"summary_prompt\": \"\"\"Summarize the scraped content and format summary EXACTLY as follows:\n",
    "---\n",
    "*Company name*:\n",
    "`Acme Corp`\n",
    "---\n",
    "*Website*:\n",
    "`acmecorp.com`\n",
    "---\n",
    "*Description*:\n",
    "`Company that does things.`\n",
    "---\n",
    "*Tags*:\n",
    "`Manufacturing. Retail. E-commerce.`\n",
    "---\n",
    "*Takeaways*:\n",
    "`Provides shareholders with value by selling products.`\n",
    "---\n",
    "*Questions*:\n",
    "`What products do they sell? How do they make money? What is their market share?`\n",
    "---\n",
    "\"\"\"\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output is stored in the summary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(chat_result.summary)"
   ]
  }
 ],
 "metadata": {
  "front_matter": {
   "description": "Scrapping web pages and summarizing the content using agents with tools.",
   "tags": [
    "web",
    "apify",
    "integration",
    "tool/function"
   ],
   "title": "Web Scraper Agent using Apify Tools"
  },
  "kernelspec": {
   "display_name": "autogen",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
